Statistical Inference VI

Chelsea Parlett-Pelleriti

Frequentist Hypothesis Testing

Two Sided Hypotheses

So far, we’ve looked at these plots using a one-sided (directional) null. Let’s look at a two-sided (non-directional) null.

Hypothesis Testing and CIs

Side Note on Jerzy Neyman

Comparison of Fisher vs. NP

  • Role of the p-value/critical-value

    • Fisher: A continuous measure indicating the strength of evidence against H₀.

    • Neyman-Pearson: A decision making tool that compares results to threshold.

Comparison of Fisher vs. NP

  • Error Rates

    • Fisher: Does not focus on binary decision making

    • Neyman-Pearson: Controlss Type I and Type II errors through α and β

Comparison of Fisher vs. NP

  • Philosophy

    • Fisher: Inductive inference; aims to measure evidence without making definitive decisions.

    • Neyman-Pearson: Frequentist perspective; focuses on long-run error rates and decision rules.

Frankenstein Approach to Significance Testing

i’m sorry.

NHST

In NHST, we claim \(H_0\) is true, and try to provide evidence against it using Reductio Ad Absurdum. We calculate a p-value, and use it to decide whether \(H_0\) should be rejected or not.
If \(H_0\) were true, the data we observed is absurd, therefore we will act as if \(H_0\) is false

NHST

Modern NHST mixes the Fisherian and Neyman-Pearson frameworks where p-values are used as a binary decision making tool, but are also often treated as continuous measures of evidence at the same time. This leads to misuse and misconceptions.

\[ p \lt 0.05 \]

P-value Myths

Review: \(p = p(\text{test statistic} \mid H_0)\)

  • p-values are not the probability the null is true

  • p-values are not the probability the effect will replicate

  • non-significant p-values do not mean that the null is true

P-hacking

Family Wise Error Rates

❓If I use an \(\alpha = 0.05\), and I run 20 tests, what is the probability that I get at least one significant p-value

Family Wise Error Rates

❓If I use an \(\alpha = 0.05\), and I run 20 tests, and the null is true, what is the probability that I get at least one significant p-value out of these 20 tests.

\[ fwer = 1- \underbrace{0.95^{20}}_\text{p(all non sig)} = 64\% \]

Family Wise Error Rates

Family Wise Error Rates

To correct for this, we often use things like a Bonferroni or Sidak correction:

  • Bonferroni: \(p_{thresh} = \frac{\alpha}{m}\); where \(m\) is the number of tests

  • Sidak: \(p_{thresh} = 1 - (1-\alpha)^{\frac{1}{m}}\); where \(m\) is the number of tests

Exercise for the listener: What is the FWER using these new thresholds? How does it change as \(m \to \infty\) ?

Equivalence Testing

Remember, non-significant p-values do not mean that the null is true

Claim: there are no black swans.

from: wikipedia

from: wikipedia

Equivalence Testing

Equivalence Testing

from: Equivalence Tests A Practical Primer for t Tests, Correlations, and Meta-Analyses

from: Equivalence Tests A Practical Primer for t Tests, Correlations, and Meta-Analyses

Equivalence Testing in R

# is the mean anxiety rating practically equivalent to 0?
library(TOSTER)

test <- tsum_TOST(m1 = mean(anxiety),
          mu = 0,
          sd1 = sd(anxiety),
          n1 = length(anxiety),
          low_eqbound = -0.1*sd(anxiety),
          high_eqbound = 0.1*sd(anxiety))

test$decision
$TOST
[1] "The equivalence test was significant, t(999) = -2.835, p = 2.34e-03"

$ttest
[1] "The null hypothesis test was non-significant, t(999) = 0.327, p = 7.44e-01"

$combined
[1] "NHST: don't reject null significance hypothesis that the effect is equal to zero \nTOST: reject null equivalence hypothesis"

Equivalence Testing in R

plot(test, type = "cd")

Quick Review

Neyman-Pearson Hypothesis Testing

Note: we choose our cutoff so that only \(\alpha\) of our null sampling distribution is more extreme than our cutoff. Thus, we choose the Type I Error Rate (when sampling from the null, how often will we get a value more extreme than our cutoff?)

Power Analysis

If there is an effect, how likely are you to detect it (\(\beta\))?

Fail to Reject H0 Reject H0
H0 True \(1-\alpha\) Correct \(\alpha\) Type I Error
H1 True \(\beta\) Type II Error \(1-\beta\) Power

Power Analysis

Power Analysis

❓ What are things we could change that would increase our statistical power?

Power Analysis

  • sample size

  • population standard deviation

  • effect size

  • \(\alpha\)

Power Analysis: \(n\)

Power Analysis: \(\alpha\)

Power Analysis: Effect Size

Power Analysis: \(\sigma\)

Bayesian Hypothesis Testing

Bayesian Parameter Estimation Review

Bayesianism main Ideas:

  1. data \(X\) is fixed, and the parameters \(\theta\) of our process \(P_{\theta}\) are random

    • we imagine different parameter values that could exist
  2. inference relies on the idea of updating prior beliefs based on evidence from the data

  3. probabilities are used to quantify uncertainty we have about parameters

\[ \underbrace{p(\theta|d)}_\text{posterior} = \underbrace{\frac{p(d|\theta)}{p(d)}}_\text{update} \times \underbrace{p(\theta)}_\text{prior} \]

Bayesian Parameter Estimation Review

Bayesian Parameter Estimation Review

In Bayesian Parameter Estimation we use the Posterior to create summaries that tell us the plausibility of different values for \(\theta\).

In Bayesian Hypothesis Testing we use the Posterior (or parts of it) to calculate support for a particular theory/hypothesis

Bayesian Hypothesis Testing

There are many ways to test hypotheses in a Bayesian Framework. Three main ones:

  1. Check if a Credible Interval overlaps with value(s) of interest

  2. Posterior Odds

  3. Bayes Factors (Likelihood odds)

Bayesian Hypothesis Testing: CI

\(H_0: \theta = 0\); is \(0\) in our credible interval?

Bayesian Hypothesis Testing: PO

\(H_0: \theta \leq 0\); \(H_A: \theta \gt 0\)

Posterior Odds: \(\frac{P(H_A \mid \text{data})}{P(H_0 \mid \text{data})} = 3.83\)

Bayesian Hypothesis Testing: PO

\(H_0: \theta \leq 0\); \(H_A: \theta \gt 0\)

Posterior Odds: \(\frac{P(H_A \mid \text{data})}{P(H_0 \mid \text{data})} = 3.83\)

Interpretation: the Alternative Hypothesis is \(\approx 4\) times more likely than the Null Hypothesis

Bayesian Hypothesis Testing: PO

\(H_0: \theta \leq 0\); \(H_A: \theta \gt 0\)

Posterior Odds: \(\frac{P(H_A \mid \text{data})}{P(H_0 \mid \text{data})} = 0.79\)

Bayesian Hypothesis Testing: PO

\(H_0: \theta \leq 0\); \(H_A: \theta \gt 0\)

Posterior Odds: \(\frac{P(H_A \mid \text{data})}{P(H_0 \mid \text{data})} = 0.2642\)

Interpretation: the Alternative Hypothesis is \(\approx 0.2642\) times more likely than the Null Hypothesis; the Null Hypothesis is \(\approx \frac{1}{0.2642} = 3.785\) times more likely than the Alternative

❗We just provided evidence for the Null

Bayesian Hypothesis Testing: PO

We want to know whether Treatment has an effect on number of seizures.

library(brms)
# model with the treatment effect
fit1 <- brm(
  count ~ zAge + zBase + Trt,
  data = epilepsy, family = negbinomial(),
  prior = prior(normal(0, 1), class = b),
  save_all_pars = TRUE,
  silent = 2, refresh = 0

)

# model without the treatent effect
fit2 <- brm(
  count ~ zAge + zBase,
  data = epilepsy, family = negbinomial(),
  prior = prior(normal(0, 1), class = b),
  save_all_pars = TRUE,
  silent = 2, refresh = 0

)

# specify prior model probabilities
post_prob(fit1,
          fit2,
          prior_prob = c(0.5, 0.5))
#      fit1      fit2 
# 0.3501424 0.6498576

Bayesian Hypothesis Testing: BF

  • Posterior odds compare the posterior probabilities of two hypotheses (after seeing data)

  • Prior odds compare the prior probabilities of two hypotheses (before seeing data)

  • Bayes Factors represent how much the data changes the prior odds when transforming to posterior odds

\[ \underbrace{\frac{p(H_A \mid \text{data})}{p(H_0 \mid \text{data})}}_\text{Posterior Odds} = \underbrace{\frac{p(\text{data} \mid H_A)}{p(\text{data} \mid H_0)}}_\text{Bayes Factor} \cdot \underbrace{\frac{p(H_A)}{p(H_0)}}_\text{Prior Odds} \]

Bayesian Hypothesis Testing: BF

  • \(BF \lt 1\): seeing the data made \(H_A\) less plausible

  • \(BF = 1\): seeing the data did not change the relative plausibility of \(H_A\)

  • \(BF > 1\): seeing the data made \(H_A\) more plausible

Bayesian Hypothesis Testing: BF

library(brms)
# model with the treatment effect
fit1 <- brm(
  count ~ zAge + zBase + Trt,
  data = epilepsy, family = negbinomial(),
  prior = prior(normal(0, 1), class = b),
  save_all_pars = TRUE,
  silent = 2, refresh = 0
)

# model without the treatment effect
fit2 <- brm(
  count ~ zAge + zBase,
  data = epilepsy, family = negbinomial(),
  prior = prior(normal(0, 1), class = b),
  save_all_pars = TRUE,
  silent = 2, refresh = 0

)

# compute the bayes factor
bayes_factor(fit1, fit2)

Bayesian Interval Estimates: ROPE

Region of Practical Equivalence: an interval/range of values that are practically equivalent to no effect.

  • any change in depression scores \(\pm 0.25\) is clinically irrelevant
  • any clickrate between \([0.01, 0.03]\) is practically equivalent to \(0.02\)
  • any regression coefficient that is \(0 \pm \frac{1}{10} sd\) means there’s no effect of that predictor

Smallest Effect Size of Interest: the smallest effect size that would be meaningful, clinically relevant, or impactful.

Bayesian Interval Estimates: ROPE

  1. Define a ROPE (use domain expertise or a “standard” small value like \(\frac{1}{10} sd\) )

  2. Calculate what % of your Posterior CI overlaps with ROPE

If there is a lot of overlap, evidence for practical equivalence. If little overlap, evidence for non-equivalence.

Bayesian Interval Estimates: ROPE